ETL Games Pipeline

View on GitHub

This project is a Python ETL pipeline that extracts video game data from the RAWG API, transforms and normalises it into relational tables, loads it into PostgreSQL, and saves JSON snapshots for backup, to use data for analysis without making new API calls. My goal was to build a data pipeline from a live API and load it to structured database. Handle real-world challenges like nested JSON, many-to-many relationships, API rate limiting, and duplicate prevention.

Python PostgreSQL RAWG API Pandas SQLAlchemy psycopg2 JSON dotenv

Pipeline Overview

The pipeline collects games released between 2022 and 2024 across 7 parent platforms: PC, PlayStation, Xbox, iOS, Android, Nintendo, and Web. For each platform, up to 400 games are retrieved in pages of 40, ordered by ratings count. Data is extracted, flattened from nested API responses, and inserted into PostgreSQL and saved to Json files for backup for later analysis to prevent addtional API calls and handle limitations.

Terminal output showing pipeline running

Extract: RAWG API

Each API call fetches a page of game results filtered by date range and platform. The pipeline is automatically moving to the next page until the target count is reached or no more results are returned.

To respect rate limits, the pipeline pauses 3 seconds between every call and sleeps for 10 minutes after every 50 calls. Every successful call prints a confirmation to the console.

Transform: Flattening Nested Data

API responses return deeply nested JSON. Each game object contains arrays of genres, platforms, ratings, and ESRB data. pd.json_normalize() is used to flatten data into DataFrames. Columns are renamed to match the database schema, missing fields are filled with None, and duplicates are removed before inserting.

Bridge tables are built by merging the flattened genre and platform records with database-assigned IDs retrieved with pd.read_sql(), ensuring referential integrity across all tables.

Database Schema

Data is stored in a raw schema in PostgreSQL across 5 tables. The many-to-many relationships between games, genres, and platforms are managed through two bridge tables.

Entity Relationship Diagram showing all 5 tables

raw.games

ColumnTypeNotes
game_idSERIALPrimary key, auto-generated
rawg_game_idINT UNIQUEOriginal ID from RAWG API
game_nameTEXTTitle of the game
release_dateDATEOfficial release date
age_ratingTEXTESRB rating (e.g. Mature, Teen)
ratings_countINTNumber of user ratings on RAWG
ratingREALAverage user rating
created_atTIMESTAMPRow insertion timestamp

raw.genres

ColumnTypeNotes
genre_idSERIALPrimary key, auto-generated
rawg_genre_idINT UNIQUEOriginal ID from RAWG API
genre_nameTEXTe.g. Action, RPG, Strategy
created_atTIMESTAMPRow insertion timestamp

raw.platforms

ColumnTypeNotes
platform_idSERIALPrimary key, auto-generated
rawg_platform_idINT UNIQUEOriginal ID from RAWG API
platform_nameTEXTe.g. PC, PlayStation 5, iOS
created_atTIMESTAMPRow insertion timestamp

raw.bridge_games_genres

ColumnTypeNotes
game_idINTForeign key - raw.games
genre_idINTForeign key - raw.genres

raw.bridge_platforms_games

ColumnTypeNotes
game_idINTForeign key - raw.games
platform_idINTForeign key - raw.platforms

Load: PostgreSQL & JSON

Each record is inserted using ON CONFLICT DO NOTHING, making every run idempotent and safe to re-run without creating duplicates. After all inserts, the connection is committed and closed.

All collected records are saved as local JSON snapshots: games_raw.json, genres_raw.json, platforms_raw.json, bridge_games_genres.json, and bridge_platforms_games.json. These act as a backup of the database and allows analysis in without making new API calls.

pgAdmin showing query results from the games table

Conclusion

This project follows data engineering process: authenticating to an external API, dealing with paginated and nested JSON data, normalizing data to a relational format, dealing with many-to-many relationships with bridge tables, and loading data into PostgreSQL. The JSON snapshots create additional layer of data availability for analysis without needing a database connection or new API calls.

Data source:

RAWG Video Games Database API · Games released 2022–2024 · Collected via Python ETL pipeline